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1.  Introduction 


Object  recognition  in  clnttered  environments  is  a  difficnlt  problem  with  widespread 
applications.  Most  approaches  to  object  recognition,  inclnding  the  one  presented  here,  rely 
on  the  algorithm  hrst  hnding  correspondences  between  model  featnres  and  image  featnres, 
then  compnting  a  hypothesized  model  pose,  and  hnally  searching  for  additional  image 
featnres  that  snpport  this  pose.  The  most  challenging  part  of  this  process  is  the 
identihcation  of  corresponding  featnres  when  the  images  are  affected  by  clntter,  partial 
object  occlnsion,  changes  in  illnmination,  and  changes  in  viewpoint.  In  fact,  once  the 
featnre  correspondence  problem  is  solved,  object  recognition  becomes  almost  trivial.  A 
wide  variety  of  featnres  has  been  employed  by  object  recognition  systems,  inclnding  points, 
edges,  and  textnred  regions.  There  are  advantages  and  disadvantages  to  each  type  of 
featnre,  and  each  is  snitable  for  different  applications. 

The  snrfaces  of  many  objects  consist  of  regions  of  nniform  color  or  textnre.  Most  of  the 
information  available  for  object  recognition  is  at  the  bonndaries  (edges)  of  these  regions;  a 
line  drawing  representation  of  these  objects  provides  a  nearly  complete  description  of  these 
objects.  Approaches  to  object  recognition  that  rely  on  variations  in  textnre  inside  these 
regions  are  likely  to  perform  poorly.  Fnrthermore,  objects  snch  as  bicycles,  chairs,  and 
ladders,  that  are  composed  of  thin,  stick-like  components  are  especially  difficnlt  for 
textnre-based  approaches  becanse  backgronnd  clntter  will  be  present  within  a  few  pixels  of 
any  object  pixel,  thns  corrnpting  local  textnre  templates  {4)-  Methods  that  rely  on  the 
bonndary  shapes  are  better  snited  to  these  types  of  objects.  This  report  presents  a  simple, 
effective,  and  fast  method  for  recognizing  partially  occlnded  2D  (two-dimensional)  objects 
in  clnttered  environments,  where  the  object  models  and  their  images  are  each  described  by 
sets  of  line  segments.  A  fair  amonnt  of  perspective  distortion  is  tolerated  by  the  algorithm, 
so  the  algorithm  is  also  applicable  to  3D  (three-dimensional)  objects  that  are  represented 
by  sets  of  viewpoint-dependent  2D  models.  Becanse  of  the  close  ties  between  object 
recognition  and  featnre  correspondence,  this  report  is  abont  a  new  featnre  correspondence 
algorithm  as  mnch  as  it  is  abont  a  new  object  recognition  algorithm. 

Onr  approach  assnmes  that  at  least  one  model  line  is  detected  as  an  nnfragmented  line  in 
the  image.  By  unfragmented,  we  mean  that  the  corresponding  image  line  is  extracted  from 
the  image  as  a  single  continnons  segment  between  the  two  end  points  of  the  projected 
model  line.  This  necessarily  reqnires  that  at  least  one  model  line  be  nnocclnded. 

Additional  model  lines  mnst  be  present  in  the  image  for  verihcation,  bnt  these  may  be 
partially  occlnded  or  fragmented.  A  potential  difficnlty  with  this  approach  is  that  line 
detection  algorithms  often  fragment  lines  becanse  of  difficnlties  in  parameter  selection,  and 
they  nsnally  do  not  extract  lines  completely  at  the  intersections  with  other  lines.  The  issne 
of  fragmentation  resnlting  from  poor  parameter  selection  can  be  ameliorated  throngh 


1 


post-processing  steps  that  combine  nearby  collinear  lines.  However,  this  has  not  been 
necessary  in  any  of  onr  experiments.  The  issne  of  line  detection  algorithms  being  nnable  to 
accnrately  locate  the  end  points  of  lines  at  the  intersections  with  other  lines  do  not  canse  a 
problem  becanse  a  few  missing  pixels  at  the  ends  of  a  line  does  not  signihcantly  affect  the 
compnted  model  transformations  (except  in  the  case  when  the  object’s  image  is  so  small  as 
to  make  recognition  difficnlt,  regardless  of  how  well  the  object’s  edges  are  detected).  We 
show  that  onr  line  detector  is  able  to  detect  a  large  nnmber  of  object  lines  with  very  little 
relative  error  in  their  length  when  compared  to  the  corresponding  projected  model  lines. 

A  three-stage  process  is  nsed  to  locate  objects.  In  the  hrst  stage,  a  list  of  approximate 
model  pose  hypotheses  is  generated.  Every  pairing  of  a  model  line  to  an  image  line  hrst 
contribntes  a  pose  hypothesis  consisting  of  a  similarity  transformation.  When  both  the 
model  line  and  the  corresponding  image  line  form  corner-like  strnctnres  with  other  nearby 
lines  and  the  angles  of  the  corners  are  similar  (within  45  degrees),  a  pose  hypothesis 
consisting  of  an  affine  transformation  is  added  to  the  hypothesis  list,  one  for  each  snch 
compatible  corner  correspondence.  Typically,  each  model-to-image  line  correspondence 
contribntes  a  small  nnmber  of  poses  (one  to  six)  to  the  hypothesis  list. 

We  employ  information  inherent  in  a  single  line  correspondence  (position,  orientation,  and 
scale)  to  rednce  the  nnmber  of  correspondences  that  mnst  be  examined  in  order  to  hnd  an 
approximately  correct  pose.  For  m  model  lines  and  n  image  lines,  we  generate  0{mn) 
approximate  pose  hypotheses.  Compare  this  to  traditional  algorithms  that  generate  precise 
poses  from  three  pairs  of  correspondences,  where  there  are  as  many  as  0{rn?ii?)  pose 
hypotheses.  An  approach  snch  as  random  sample  consensns  (RANSAC)  (5),  which 
examines  a  very  small  fraction  of  these  hypotheses,  still  has  to  examine  0{n^)  poses  to 
ensnre  with  probability  0.99  that  a  correct  precise  pose  will  be  fonnd  (h).  By  starting  with 
an  approximate  pose  instead  of  a  precise  pose,  we  are  able  to  greatly  rednce  the  nnmber  of 
poses  that  need  to  be  examined  and  nltimately  hnd  a  correct  precise  pose. 

Most  of  the  pose  hypotheses  will  be  inaccnrate  because  most  of  the  generating 
correspondences  are  incorrect.  The  second  stage  of  our  approach  ranks  each  pose 
hypothesis  based  on  the  similarity  of  the  corresponding  local  neighborhoods  of  lines  in  the 
model  and  image.  The  new  similarity  measure  is  largely  unaffected  by  image  clutter, 
partial  occlusion,  and  fragmentation  of  lines.  Nearest  neighbor  search  is  used  in  order  to 
compute  the  similarity  measure  quickly  for  many  pose  hypotheses.  Because  this  similarity 
measure  is  computed  as  a  function  of  approximate  pose,  the  ranking  of  the  pose  hypotheses 
is  invariant  to  image  translation,  scaling,  rotation,  and  partially  invariant  to  affine 
distortion  of  the  image.  By  combining  the  process  of  pose  hypothesis  generation  from 
assumed  unfragmented  image  lines  with  the  neighborhood  similarity  measure,  we  are  able 
to  quickly  generate  a  ranked  list  of  approximate  model  poses  which  is  likely  to  include  a 
number  of  highly  ranked  poses  that  are  close  to  the  correct  model  pose. 
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The  final  stage  of  the  approach  applies  a  more  time-consuming  but  also  more  accurate  pose 
refinement  and  verification  algorithm  to  a  few  of  the  most  highly  ranked  approximate 
poses.  Gold’s  graduated  assignment  algorithm  {10,11),  modified  for  line  correspondences, 
is  used  for  this  purpose  because  it  is  efficient,  tolerant  of  clutter  and  occlusion,  and  does 
not  make  binary  correspondence  decisions  until  an  optimal  pose  is  found. 

Our  three-stage  approach  allows  CPU  resources  to  be  quickly  focused  on  the  highest  payoff 
pose  hypotheses,  which  in  turn  results  in  a  large  reduction  in  the  amount  of  time  needed  to 
perform  object  recognition.  An  outline  of  the  algorithm  is  shown  in  figure  1.  In  the 
following  sections,  we  first  describe  related  work  and  then  describe  each  step  of  our 
algorithm  in  more  detail.  Although  any  line  detection  algorithm  may  be  used,  appendix  A 
briefly  discusses  the  line  detection  algorithm  that  we  use  and  presents  an  evaluation  of  its 
ability  to  extract  unfragmented  lines  from  an  image.  Section  3  then  shows  how  approximate 
pose  hypotheses  are  generated  from  a  minimal  number  of  line  correspondences.  Next,  in 
sections  4  and  5,  we  present  our  method  for  efficiently  comparing  local  neighborhoods  of 
model  lines  to  local  neighborhoods  of  image  lines.  Section  6  describes  the  pose  refinement 
and  verification  algorithm  that  we  use.  Experiments  with  real  imagery  containing  high 
levels  of  clutter  and  occlusion  (see  figure  2,  for  example)  are  discussed  in  section  7  and 
demonstrate  the  effectiveness  of  the  algorithm;  this  section  also  gives  the  run-time 
complexity  of  the  algorithm.  We  see  that  our  algorithm  is  faster  and  able  to  handle  greater 
amounts  of  clutter  than  previous  approaches  that  use  line  features.  The  approach  is  able  to 
recognize  planar  objects  that  are  rotated  by  as  much  as  60  degrees  away  from  their 
modeled  viewpoint  and  recognize  3-D  objects  from  2-D  models  that  are  rotated  by  as  much 
as  30  degrees  from  their  modeled  viewpoint.  The  report  ends  with  conclusions  in  section  8. 


2.  Related  Work 


Automatic  registration  of  models  to  images  is  a  fundamental  and  open  problem  in 
computer  vision.  Applications  include  object  recognition,  object  tracking,  site  inspection 
and  updating,  and  autonomous  navigation  when  scene  models  are  available.  It  is  a  difficult 
problem  because  it  comprises  two  coupled  problems,  the  correspondence  problem  and  the 
pose  problem,  each  easy  to  solve  only  if  the  other  has  been  solved  first. 

A  wide  variety  of  approaches  to  object  recognition  has  been  proposed  since  Robert’s 
ground-breaking  work  on  recognizing  3-D  polyhedral  objects  from  2-D  perspective  images 
{19).  Among  the  pioneering  contributions  are  Fischler  and  Bolles’  RANSAC  method  {9), 
Baird’s  tree-pruning  method  {2),  and  Ullman’s  alignment  method  {22).  These  approaches, 
which  hypothesize  poses  from  small  sets  of  correspondences  and  reject  or  accept  those 
poses  based  on  the  presence  of  supporting  correspondences,  become  intractable  when  the 
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Create  a  data  structure  for  nearest  neighbor  and  range  searches  of 
image  lines. 

Using  a  range  search,  identify  corners  in  each  model  and  in  the  image, 
for  each  model  do 

Ti  =  .  II  Initialize  hypothesis  list  to  empty, 

for  each  pair  of  model  line  I  and  image  line  I' ,  do 
C  —  Pose  hypotheses  generated  from  I,  I' ,  and  nearby  corners. 

n  =  nuc. 

Evaluate  the  similarity  of  model  and  image  neighborhoods  for  poses 
C. 

end  for 

V  =  Sort  T-L  based  on  neighborhood  similarity  measure, 
for  i  =  1  to  do 

Apply  the  graduated  assignment  algorithm  starting  from  pose 

Vii). 

if  a  sufficient  number  of  line  correspondences  are  found  then 
An  object  has  been  recognized. 

end  if 
end  for 
end  for 


Figure  1.  Outline  of  the  new  object  recognition  algorithm.  (The  constant  N  is  the  number  of  pose 
refinements  performed;  as  discussed  in  section  7,  good  performance  is  obtained  with  Af  =  4.) 


u? 

lui 

(a)  Models 


(b)  Test  image 


(d)  Recognized  books 


Figure  2.  Recognizing  books  in  a  pile.  (The  two  models  were  generated  from  frontal  images  of  the 
books.) 
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number  of  model  and  image  features  becomes  large,  especially  when  the  image  contains 
signihcant  clutter. 

More  recently,  the  use  of  rich  feature  descriptors  has  become  popular  as  a  way  of  reducing 
the  number  of  feature  correspondences  that  must  be  examined.  The  Harris  corner  detector 
{13)  has  seen  widespread  use  for  this  purpose;  however,  it  is  not  stable  to  changes  in  image 
scale,  so  it  performs  poorly  when  models  and  images  of  different  scales  are  matched. 

Schmid  and  Mohr  {20)  have  developed  a  rotationally  invariant  feature  descriptor  using  the 
Harris  corner  detector.  Lowe  {16)  extended  this  work  to  scale  invariant  and  partially  affine 
invariant  features  with  his  scale  invariant  feature  transformation  (SIFT)  approach,  which 
uses  scale-space  methods  to  determine  the  location,  scale,  and  orientation  of  features,  and 
then,  relative  to  these  parameters,  a  gradient  orientation  histogram  describing  the  local 
texture.  Excellent  results  have  been  obtained  by  approaches  with  these  rich  features  when 
objects  have  signihcant  distinctive  texture.  However,  there  are  many  common  objects  that 
possess  too  little  distinctive  texture  for  these  methods  to  be  successful.  Examples  include 
thin  objects  such  as  bicycles  and  ladders  where  background  clutter  will  be  present  near  all 
object  boundaries  and  uniformly  textured  objects  such  as  upholstered  furniture.  In  these 
cases,  only  the  relations  between  geometric  features  (such  as  points  and  edges)  can  be  used 
for  matching  and  object  recognition.  Edges  are  sometimes  preferred  to  points  because  they 
are  easy  to  locate  and  are  stable  features  on  textured  and  nontextured  objects. 

Our  approach  has  some  similarities  to  Ayache  and  Faugeras’s  hypotheses  predicted  and 
evaluated  recursively  (HYPER)  system  {!).  They  use  a  tree-pruning  algorithm  to 
determine  2-D  similarity  transformations  that  best  align  2-D  object  models  with  images, 
where  both  the  models  and  images  are  represented  by  sets  of  line  segments.  The  ten 
longest  lines  in  the  model  are  identihed  as  “privileged”  segments,  which  are  used  for  initial 
hypothesis  generation  because  there  are  fewer  of  them  (so  fewer  hypotheses  have  to  be 
generated)  and  because  the  use  of  long  segments  results  in  more  accurate  pose  estimates. 
The  authors  point  out  that  the  probability  of  having  all  privileged  segments  simultaneously 
occluded  is  very  small,  and  only  one  privileged  segment  needs  to  be  visible  for  us  to  identify 
a  model.  Although  this  is  true,  we  believe  that  long  model  lines  are  just  as  likely  as  short 
model  lines  to  be  fragmented  in  an  image,  and  therefore,  we  treat  all  model  lines  identically 
and  do  not  identify  any  as  privileged.  In  the  HYPER  system,  2-D  pose  hypotheses  are 
generated  by  matching  each  privileged  model  line  is  matched  to  every  compatible  image 
line,  where  compatibility  is  dehned  in  terms  of  the  difference  in  the  angles  of  corners 
formed  with  neighbors,  and  the  difference  in  scale  from  an  a  priori  scale.  The  hypotheses 
are  ranked  based  on  the  degree  of  compatibility  of  the  matched  segments,  and  then  the 
best  hypotheses  are  rehned  with  a  tree  search  to  locate  additional  matching  model  and 
image  segments  that  are  compatible  with  the  initial  pose  estimate.  During  this  tree  search, 
a  pose  hypothesis  is  augmented  with  additional  matches  when  the  difference  in  orientation 
of  the  two  segments,  the  Euclidean  distance  between  their  midpoints,  and  the  relative 
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difference  between  their  lengths,  is  small.  In  contrast,  onr  pose  hypotheses  are  based  on 
affine  transformations  instead  of  similarity  transformations;  we  nse  a  dissimilarity  measnre 
(see  section  5)  to  rank  hypotheses,  which  is  less  affected  by  line  fragmentation  becanse  it 
does  not  depend  on  the  lengths  of  lines  nor  on  nniqne  reference  points  on  the  lines,  and  we 
nse  the  more  robnst  and  efficient  gradnated  assignment  algorithm  {10)  for  pose  rehnement. 

A  nnmber  of  more  recent  works  {18,4)  have  also  nsed  edges  for  object  recognition  of  poorly 
textnred  objects.  Mikolajczyk  et  ah  {18)  generalize  Lowe’s  SIFT  descriptors  to  edge 
images,  where  the  position  and  orientation  of  edges  are  nsed  to  create  local  shape 
descriptors  that  are  orientation  and  scale  invariant.  Carmichael’s  approach  {4)  nses  a 
cascade  of  classihers  of  increasing  apertnre  size,  trained  to  recognize  local  edge 
conhgnrations,  to  discriminate  between  object  edges  and  clntter  edges;  this  method 
reqnires  many  training  images  to  learn  object  shapes,  and  it  is  not  invariant  to  changes  in 
image  rotation  or  scale. 

Gold  and  Rangarajan  {11)  simnltaneonsly  compnte  pose  and  2D-to-2D  or  3D-to-3D  point 
correspondences  nsing  deterministic  annealing  to  minimize  a  global  objective  fnnction.  We 
previonsly  nsed  this  method  (5)  for  matching  3-D  model  lines  to  2-D  image  lines,  and  we 
nse  it  here  for  the  pose  rehnement  stage  of  onr  algorithm.  Beveridge  (<?)  matches  points  and 
lines  nsing  a  random  start  local  search  algorithm.  Whitley  et  ah  {23)  present  an  algorithm 
for  2-D  pose  estimation  that  nses  a  spatial  henristic  similar  to  onr  corner  correspondences 
to  initialize  a  messy  genetic  algorithm  and  then  nses  Beveridge’s  local  search  algorithm 
to  rehne  individnals  in  the  popnlation.  Denton  and  Beveridge  (7)  extended  Beveridge’s 
original  work  by  replacing  random  starts  with  a  henristic  that  is  nsed  to  select  which  initial 
correspondence  sets  to  apply  the  local  search  algorithm.  Althongh  we  nse  line  featnres 
instead  of  point  featnres,  Denton’s  approach  is  conceptnally  similar  to  onrs  in  a  nnmber  of 
ways.  Both  approaches  hrst  hypothesize  poses  nsing  small  sets  of  local  correspondences, 
then  sort  the  hypotheses  based  on  a  local  match  error,  and  hnally  apply  a  pose  rehnement 
and  verihcation  algorithm  to  a  small  nnmber  of  the  best  hypotheses.  Signihcant  differences 
between  the  two  approaches  are  that  onr  approach  nses  lines  instead  of  points,  and  zero  or 
one  neighboring  featnres  instead  of  fonr  to  generate  pose  hypotheses;  thns,  onr  approach 
will  have  many  fewer  hypotheses  to  consider,  and  each  hypothesis  is  mnch  less  likely  to  be 
corrnpted  by  spnrions  featnres  (clntter). 


3.  Generating  Pose  Hypotheses 


We  wish  to  generate  a  small  set  of  approximate  poses  that,  with  high  certainty,  inclndes  at 
least  one  pose  that  is  close  to  the  trne  pose  of  the  object.  The  smaller  the  nnmber  of 
correspondences  nsed  in  estimating  a  pose,  the  less  likely  the  estimated  pose  will  be 
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corrupted  by  spurious  correspondences.  At  the  same  time,  however,  using  fewer 
correspondences  will  produce  a  less  accurate  pose  when  all  correspondences  used  by  the 
estimation  are  correct.  From  a  single  correspondence  of  a  model  line  to  an  image  line, 
where  the  image  line  may  be  fragmented  (only  partially  detected  because  of  partial 
occlusion  or  faulty  line  detection),  we  can  compute  the  2-D  orientation  of  the  model  as  well 
as  a  one-dimensional  constraint  on  its  position,  but  the  scale  and  translation  of  the  model 
cannot  be  determined;  this  does  not  provide  sufficient  geometric  constraints  to  evaluate  the 
similarity  of  a  local  region  of  the  model  with  a  local  region  of  the  image. 

On  the  other  hand,  if  we  assume  that  a  particular  image  line  is  unfragmented,  then  from  a 
single  correspondence  of  a  model  line  to  this  image  line,  we  can  compute  a  2-D  similarity 
transformation  of  the  model.  This  is  possible  because  the  two  end  points  of  the 
unfragmented  image  line  must  correspond  to  the  two  end  points  of  the  model  line,  and  two 
corresponding  points  are  sufficient  to  compute  a  similarity  transformation.  A  similarity 
transformation  will  be  accurate  when  the  viewing  direction  used  to  generate  the  2-D  model 
is  close  to  the  viewing  direction  of  the  object.  However,  even  when  there  is  some 
perspective  distortion  present,  approximate  similarity  transformations  from  correct 
correspondences  are  often  highly  ranked  by  the  next  stage  of  our  approach.  Generating 
hypothesized  poses  that  are  highly  ranked  in  the  next  stage  is  the  main  goal  of  this  hrst 
stage  since  the  pose  rehnement  algorithm  used  in  the  hnal  stage  has  a  fairly  large  region  of 
convergence. 


Because  we  do  not  know  which  end  point  of  the  model  line  corresponds  to  which  end  point 
of  the  image  line,  we  consider  both  possibilities  and  generate  a  similarity  transformation 
for  each.  For  pi  and  p2  model  line  end  points  corresponding  to  image  line  end  points  qi 
and  q2,  respectively,  the  similarity  transformation  mapping  the  model  to  the  image  is 
q*  =  Ap,  +  t  where  A  =  sR  and  s,  R,  and  t  are  the  scaling,  rotation,  and  translation, 
respectively,  dehned  by 


s 

R 

t 


I  qi  -  q2||  /  IIpi  -  P2II , 

cos  6  —  sin  9 
sin  6  cos  9  ’ 

qi  -  -4pi, 


and  where  9  is  the  rotation  angle  (in  the  range  — tt  to  tt,  clockwise  being  positive)  from 
Pi  -  P2  to  qi  -  q2. 


We  can  obtain  more  accurate  approximate  poses  with  little  additional  work  when  the 
model  line  and  the  unfragmented  image  line  (called  the  base  lines)  form  corner-like 
structures  with  other  lines:  corners  in  the  model  should  correspond  to  corners  in  the  image. 
Corners  in  the  model  are  formed  by  pairs  of  model  lines  that  terminate  at  a  common  point, 
while  corners  in  the  image  are  formed  by  pairs  of  image  lines  that  terminate  within  a  few 
pixels  of  each  other.  By  looking  at  corners,  we  expand  our  search  to  correspondences  of 
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two  line  pairs.  However,  because  we  restrict  the  search  for  corner  structures  in  the  image  to 
lines  that  terminate  within  a  few  pixels  of  an  end  point  of  a  base  image  line,  the  number  of 
corners  examined  for  any  base  image  line  is  usually  quite  small.  As  before,  we  assume  only 
that  the  base  image  line  is  unfragmented;  other  image  lines  may  be  fragmented.  If  a  base 
model  line  forms  a  corner  with  another  model  line,  which  is  usually  the  case  for  objects 
described  by  straight  edges,  and  if  the  base  image  line  is  unfragmented,  then  all  model  lines 
that  share  an  end  point  with  the  base  model  line  should  be  unoccluded  around  that  end 
point  in  the  image,  and  therefore,  there  is  a  good  chance  that  these  other  models  lines  will 
appear  in  the  image  near  the  corresponding  end  point  of  the  base  image  line.  Thus,  looking 
at  corners  formed  with  the  base  image  lines  provides  a  way  of  hnding  additional  line 
correspondences  with  a  low  outlier  rate. 

The  model  and  image  lines  that  participate  in  corner  structures  are  efficiently  located  with 
a  range  search  algorithm  {17).  The  end  points  of  all  image  lines  are  hrst  inserted  into  a 
search  tree  data  structure.  Then,  for  each  end  point  of  each  image  line,  a  range  search  is 
performed  to  locate  nearby  end  points  and  their  associated  lines.  A  similar  process  is 
performed  for  the  model  lines.  This  pre-processing  step  is  done  once  for  each  model  and 
image.  To  generate  pose  hypotheses  for  a  particular  base  correspondence,  the  angles  of 
corners  formed  with  the  base  model  line  are  compared  to  the  angles  of  corners  formed  with 
the  base  image  line.  An  affine  pose  hypothesis  is  generated  for  any  pair  of  corner  angles 
that  are  within  45  degrees.  As  before,  this  is  repeated  for  each  of  the  two  ways  that  the 
base  model  line  can  correspond  to  the  base  image  line.  Note  that  these  affine  pose 
hypotheses  are  generated  in  addition  to  the  similarity  pose  hypotheses  described  before. 
The  similarity  pose  hypotheses  are  kept  even  though  they  may  be  less  accurate  because  the 
affine  pose  hypotheses  are  more  susceptible  to  being  corrupted  by  spurious  correspondences. 

An  affine  pose  hypothesis  is  generated  as  follows.  Let  pi  and  p2  be  the  end  points  of  the 
base  model  line  and  qi  and  q2  be  the  corresponding  end  points  of  the  base  image  line  (see 
hgure  3).  Assume  that  a  pair  of  corners  is  formed  with  the  base  lines  by  model  line  p  and 
image  line  q  that  terminate  near  end  points  pi  and  qi  and  have  angles  6p  and  6q, 
respectively.  We  have  two  pairs  of  corresponding  points  and  one  pair  of  corresponding 
angles.  Since  a  2-D  affine  transformation  has  6  degrees  of  freedom  but  we  have  only  hve 
constraints  (two  for  each  point  correspondence  and  one  for  the  angle  correspondence),  we 
impose  the  additional  constraint  that  the  affine  transformation  must  scale  the  length  of  line 
p  in  the  same  way  as  it  does  the  length  of  the  base  model  line  piP2-  This,  dehnes  a  third 
pair  of  corresponding  points  ps  and  qs,  on  p  and  q,  respectively,  as  shown  in  hgure  3.  ps  is 
the  second  end  point  of  p,  and  qs  is  the  point  collinear  with  q  so  that 


qs  is  found  to  be 


qi  + 


|P3  -  Pill  ||q2  -qil 

II  l|2 

l|P2  -  Pill 


COS  6q  —  sin  9q  .  . 

sin^g  cosOq  ■ 


The  affine  transformation  mapping  a  model  point  p*  =  [  pi^  pt^  ]  to  the  image  point 

T 

q*  =  [  Qi.  Qiy  ]  is 


Qi. 

Qiy 


fll  Cb2 
as  04 


Pi. 

Piy 


+ 


tx 

tn. 


(1) 


For  each  correspondence  p*  q*  we  have  two  linear  equations  in  the  six  unknowns  Oi,  02, 
03,  04,  tx,  and  ty.  From  the  three  corresponding  points,  we  can  solve  for  the  parameters  of 
the  affine  transformation.  Figure  4  shows  the  pose  hypotheses  generated  for  a  particular 
correct  base  correspondence. 


4.  Similarity  of  Line  Neighborhoods 


The  second  stage  of  the  recognition  algorithm  ranks  all  hypothesized  approximate  model 
poses  in  the  order  that  the  pose  rehnement  and  verihcation  algorithm  should  examine 
them;  the  goal  is  to  rank  highly  those  poses  that  are  most  likely  to  lead  the  rehnement  and 
verihcation  algorithm  to  a  correct  precise  pose.  This  way,  the  hnal  stage  can  examine  the 
smallest  number  of  approximate  poses  needed  to  ensure  that  a  correct  pose  will  be  found  if 
an  object  is  present.  For  this  purpose,  a  geometric  measure  of  the  similarity  between  the 
model  (transformed  by  an  approximate  pose)  and  the  image  is  computed.  To  ensure  that 
this  similarity  measure  can  be  computed  quickly,  for  any  base  model  line  generating  a 
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Figure  4.  Pose  hypotheses  generated  for  a  correct  correspondence  of  a  real  model  and  image  line. 

(The  model  lines  [dashed  lines  and  thick  solid  line]  are  shown  overlaid  on  the  image  lines 
[thin  lines].  The  one  thick  solid  line  in  each  image  shows  the  base  correspondence:  a  model 
line  perfectly  aligned  with  an  image  line.  The  top  row  shows  the  two  similarity  transfor¬ 
mations,  one  for  each  possible  alignment  of  the  base  lines.  The  bottom  row  shows  the  two 
affine  transformations,  one  for  each  possible  corner  correspondence  of  the  base  lines.  These 
are  the  complete  set  of  transformations  hypothesized  for  this  base  correspondence.  Notice 
the  better  alignment  in  the  images  of  the  bottom  row,  resulting  from  the  use  of  corner  angle 
correspondences,  compared  to  the  upper  left  image.) 
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hypothesized  pose,  only  a  local  region  of  model  lines  snrronnding  the  base  line  (called  the 
base  model  line’s  neighborhood)  is  compared  to  the  image  lines.  Let  A4  be  the  set  of  lines 
for  a  single  model  and  I  be  the  set  of  image  lines.  We  dehne  the  neighborhood  radius  of  a 
line  I  to  be  the  smallest  distance,  denoted  r(/),  so  that  the  two  end  points  of  at  least  Wbr 
lines  (exclnding  1)  are  within  distance  r{l)  of  1.  In  all  onr  experiments,  the  valne  of  Wbr  is 
hxed  at  10  lines  (Wbr  <  l-^D-  The  neighborhood  of  a  model  line  I  is  the  set  of  Wbr  model 
lines,  A/'(/),  whose  end  points  are  within  distance  r  (/)  of  1.  Fignre  5  illnstrates  a  line  and 
its  neighbors. 


Fignre  5.  The  neighborhood  radius  of  line  I,  in  the  center  of  the  image,  is  the  minimum  distance  r(Z) 
for  which  both  end  points  of  A^nbr  lines  are  within  distance  r{l)  of  1.  (Here,  A^nbr  =  5,  but 
in  actual  experiments,  we  take  iVnbr  =  10.) 

For  a  hypothesized  approximate  model  pose  {^,  t}  generated  for  a  base  model  line  I,  let 
T  (A/"  (/) ,  A,  t)  denote  the  neighbors  of  I  transformed  by  the  pose  {A,  t},  and  let  d  (/',  /") 
denote  the  distance  (dehned  in  section  5)  between  two  lines  V  and  I”  in  the  image.  Then, 
the  geometric  similarity  between  a  model  neighborhood  J\f  transformed  by  the  pose  {A,  t} 
and  the  set  of  image  lines  I  is 

S{Af,I,A,t)=  ^  min  js'max,  min  d  (/',/")  I  .  (2) 

The  smaller  the  valne  of  S  (A/",  X,  A,  t),  the  more  “similar”  a  model  neighborhood  A/"  is  to 
the  image  I  nnder  the  transformation  {A,  t}.  The  parameter  5'max  ensnres  that  “good” 
poses  are  not  penalized  too  severely  when  a  line  in  the  model  is  fully  occlnded  in  the  image. 
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This  parameter  is  easily  set  if  we  observe  the  values  of  S  (A/",  X,  A,  t)  that  are  generated  for 
poor  poses  (that  should  be  avoided)  and  then  set  5'max  to  this  value  divided  by  N^hv 

As  explained  in  section  5,  the  distance  between  a  single  model  neighbor  and  the  closest  image 
line  can  be  found  in  time  O  (logn)  when  there  are  n  image  lines.  Since  lA/"]  =  A^nbr,  the  time 
to  compute  S  (A/", X,  A,  t)  is  O  (logn). 


5.  Distance  Between  Lines 


For  any  image  line  I'  (which  is  typically  a  transformed  model  line),  we  wish  to  efficiently 
hnd  the  line  I"  G  X  that  minimizes  d  {I' ,  I”)  in  equation  2  that  expresses  the  similarity 
between  a  model  and  image  neighborhood.  This  search  can  be  performed  efficiently  when 
each  line  is  represented  by  a  point  in  an  A-dimensional  space  and  the  distance  between  two 
lines  is  the  Euclidean  distance  between  the  corresponding  points  in  this  A-dimensional 
space.  Assuming  that  we  have  a  suitable  line  representation,  a  tree  data  structure  storing 
these  A-dimensional  points  can  be  created  in  time  O  {n  log  n)  and  the  closest  image  line 
can  be  found  in  time  O  (logn).  This  tree  structure  need  only  be  created  once  for  each 
image  and  is  independent  of  the  model  lines. 


Thus,  we  want  to  represent  each  line  as  a  point  in  an  A-dimensional  space  so  that  the 
Euclidean  distance  between  two  lines  is  small  when  the  two  lines  are  superposed.  We  would 
also  like  the  distance  function  to  be  invariant  to  partial  occlusion  and  fragmentation  of 
lines.  Representing  a  line  by  its  2-D  midpoint  is  insufficient  because  two  lines  can  have  an 
identical  midpoint  but  different  orientations.  We  could  use  the  midpoint  and  orientation  of 
a  line,  but  a  short  line  superposed  on  a  longer  line  (think  of  the  short  line  as  a  partially 
occluded  version  of  the  longer  line)  could  be  assigned  a  large  distance  because  their 
midpoints  may  be  far.  Further,  there  is  problem  associated  with  line  orientation  because  a 
line  with  an  orientation  of  6  should  produce  the  same  distance  as  when  its  orientation  is 
given  as  0  ±  2/c7r  for  k  =  1,2, . . ..  For  example,  two  lines  with  identical  midpoints  but 
orientations  179  and  —179  degrees  should  produce  the  same  distance  as  if  the  orientations 
of  the  two  lines  were  1  and  —1  degree.  It  is  not  possible  with  a  Euclidean  distance  function 
to  map  both  of  these  pairs  of  angles  to  the  same  distance.  A  solution  to  these  occlusion 
and  orientation  problems  is  to  generate  multiple  representations  of  each  line. 


Let  I  be  a  line  with  orientation  6  (relative  to  the  horizontal,  0  <  0  <  tt)  and  end  points 
[a;i,|/i]  and  [x2,y2]-  When  I  is  a  line  in  the  image  (/  G  X),  /  is  represented  by  the  two  3-D 
points 


d  2^mid  I/mid 
re'  Tm  '  Tm 


and 


6  TT  2^mid  I/mid 
re  '  rm  '  rm 


(3) 
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where  [xmid,  Umid]  =  [a^i  +  3^2,  l/i  +  I/2]  /2  is  the  midpoint  of  the  line  and  rg  and  are 
constant  scale  factors  (described  below);  these  are  the  3-D  points  used  to  create  the  data 
structure  that  is  used  in  the  nearest  neighbor  searches. 

When  I  is  a  transformed  model  line  (as  in  I'  above),  I  is  represented  by  the  set  of  3-D  points 

re'  Tm'  rm\  ’ 

where 

-^pts  = 


iVpts  -  1  ’ 

[xi.Vi]  =  [xi,yi\  +  (i  -  1)  A. 

In  words,  two  orientations  are  used  for  each  transformed  model  line,  but  the  position  of  the 
line  is  represented  by  a  series  of  Npts  points  that  uniformly  sample  the  length  of  the  line  at 
an  interval  w.  The  reason  why  multiple  sample  points  are  required  to  represent  the 
position  of  transformed  model  lines  but  not  the  image  lines  is  that  when  {^,t}  is  a  correct 
pose  for  the  model,  the  image  line,  which  may  be  partially  occluded  or  otherwise 
fragmented,  will  generally  be  shorter  than  the  transformed  model  line.  In  this  case,  the 
midpoint  of  the  image  line  will  he  somewhere  along  the  transformed  model  line,  but  the 
midpoint  of  the  transformed  model  line  may  he  off  the  image  line.  The  occlusion  problem 
of  the  representation  is  only  truly  eliminated  by  a  uniform  distribution  of  sample  points 
along  transformed  model  lines  when  there  is  a  sample  point  [xi,yi]  for  every  pixel  on  a 
transformed  model  line.  However,  we  have  found  that  placing  a  sample  point  at 
approximately  every  10th  pixel  (tn  =  10  in  equation  5)  along  each  transformed  model  line 
is  sufficient  to  solve  this  problem.  Then,  when  a  transformed  model  line  I'  is  correctly 
aligned  with  a  possibly  partially  occluded  image  line  we  will  have  d  (/',  /")  <  w/vm- 
Figure  6  illustrates  how  model  and  image  lines  are  sampled  in  the  process  of  generating  the 
points  in  equations  3  and  4. 

The  scale  factors  rg  and  are  chosen  to  normalize  the  contribution  of  the  orientation  and 
position  components  to  the  distance  measure.  Given  a  model  line  I  with  neighborhood 
radius  r  (/)  that  is  mapped  to  an  image  line  the  radius  around  I'  in  which  lines 
corresponding  to  neighbors  of  I  are  expected  to  be  found  is  dehned  to  be 
r'  (1,1')  =  (||/'||  r  (/))  /  ||/||,  which  is  just  the  neighborhood  radius  of  I  scaled  by  the  same 
amount  that  I  is  itself  scaled  by  the  mapping.  Assume  that  a  transformed  model  line  and 
an  image  line  are  represented  by  the  points  [9i,xi,yi\  and  [6*2,  a;2, 1/2],  respectively,  as 
described  by  equations  3  and  4.  When  the  orientation  of  the  lines  differ  by  7r/2  radians,  we 
want  1 01  —  02 1  =  1;  when  the  horizontal  distance  between  the  sample  points  equals  the 
neighborhood  radius  of  the  image  line,  we  want  |a:i  —  a:2|  =  1;  when  the  vertical  distance 


|IF2  -a:i,|/2 


w 


+  1) 


(5) 


[x2  -xi,y2-  yi] 


re 


^  ^ 
5  5 


m  •  m 


A  =  1,2,...,  A, 


pts 


(4) 
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Figure  6.  Sampling  the  position  of  model  and  image  lines.  (Image  line  I  is  represented  by  its  mid¬ 
point  [aJmid,  I/mid]  and  its  orientation  9.  Projected  model  line  I'  is  represented  by  the  points 
{[xi,yi],i  =  1, . . . ,  A^pts}  and  its  orientation  9'.  When  the  pose  of  a  model  is  accurate  and 
a  model  line  and  image  line  correspond,  the  midpoint  of  that  image  line  will  be  close  to 
some  sample  point  on  the  projected  model  line,  and  the  orientation  of  the  two  lines  will  be 
similar.  This  is  true  even  when  the  image  line  is  fragmented.  In  this  example,  [xmid,  2/mid] 
is  closest  to  [xt,  2/7]  •) 


between  the  sample  points  equals  the  neighborhood  radius  of  the  image  line,  we  want 
I2/1  ~  //2I  =  1-  The  value  rg  =  7r/2  satishes  the  hrst  normalization  constraint.  However, 
because  image  lines  will  have  different  neighborhood  radii,  depending  on  which  model  line 
they  correspond  to,  the  later  normalization  constraints  can  not  be  satisfied  by  a  constant 
scale  factor,  but  they  are  satisfied  for  model  and  image  lines  of  average  length  by 

^  IKII)  ^  (0) 

The  terms  in  equation  6  that  sum  over  model  lines  represent  sums  over  all  lines  in  all 
models.  This  value  for  has  worked  well  in  practice. 

Finally,  for  any  transformed  model  line  I',  to  find  the  image  line  I”  that  minimizes  d  (/',  /")  we 
simply  query  the  nearest  neighbor  data  structure  (generated  with  points  from  equation  3) 
with  all  the  points  listed  in  equation  4  and  then  use  the  distance  of  the  closest  one.  Because 
the  complexity  of  the  nearest  neighbor  search  is  0{logn),  the  use  of  multiple  points  to 
represent  lines  does  not  significantly  slow  down  the  algorithm. 


6.  Graduated  Assignment  for  Lines 


The  final  stage  of  the  object  recognition  algorithm  is  to  apply  a  pose  refinement  and 
verification  algorithm  to  the  few  “best”  approximate  poses.  We  use  the  graduated 
assignment  algorithm  {11)  for  this  purpose  because  it  is  efficient  {O  [mn]  complexity  for  m 
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model  lines  and  n  image  lines),  robnst  to  occlnsion  and  clntter,  and  does  not  make  binary 
correspondence  decisions  nntil  a  locally  optimal  pose  is  fonnd. 

Given  an  approximate  initial  pose  Tq  =  {Ao,to},  we  wish  to  find  a  2-D  affine 
transformation  T  =  {^,  t}  that  maximizes  the  nnmber  of  matches  between  model  and 
image  lines  so  that  the  distance  between  matched  lines  does  not  exceed  a  threshold  5finai- 
For  a  transformation  T  and  a  line  /,  let  ns  denote  by  T  (/)  the  transformation  of  I  by  T.  We 
assnme  that  onr  initial  pose  To  is  accnrate  enongh  so  that  the  pose  rehnement  algorithm 
does  not  have  to  consider  all  possible  correspondences  between  model  lines  I  and  image 
lines  V  bnt  only  those  correspondences  where  d  (T  (/) ,  /')  <  Sq;  here,  d  ()  is  the  distance 
fnnction  dehned  in  section  5  and  5o  is  an  initial  distance  threshold  that  allows  any 
reasonably  close  correspondence.  Limiting  the  correspondences  in  this  way  resnlts  in  a 
signihcant  increase  in  the  speed  of  this  step  withont  affecting  the  hnal  ontcome.  Let 
T'  =  {/'  G  T  I  3/  G  Af  A  d  (To  (1) ,  I')  <  do}  be  the  snbset  of  image  lines  that  are  initially 
reasonably  close  to  any  transformed  model  line. 

Given  m  model  lines  Ai  =  {lj,j  =  1, . . . ,  m},  n  image  lines  T'  =  {/^,  k  =  1, . . . ,  n},  and  an 
approximate  model  pose  To  =  {Aq,  to},  we  wish  to  hnd  the  2-D  affine  transformation  T  and 
the  (m  3-  1)  X  (n  3- 1)  match  matrix  M  that  minimizes  the  objective  fnnction 

m  n 

B  =  E E -6^).  (7) 

j=l  k=l 

M  dehnes  the  correspondences  between  model  lines  and  image  lines;  it  has  one  row  for 
each  of  the  m  model  lines  and  one  colnmn  for  each  of  the  n  image  lines.  This  matrix  mnst 
satisfy  the  constraint  that  each  model  line  match  at  most  one  image  line  and  vice  versa. 

By  adding  an  extra  row  and  colnmn  to  M,  slack  row  m  +  1  and  slack  column  n  3-  1,  these 
constraints  can  be  expressed  as  Mjk  G  {0, 1}  for  1  <  j  <  m  3-  1  and  1  <  /c  <  n  3-  1, 

Mji  =  1  for  1  <  j  <  m,  and  ^ik  =  1  ioi  1  <  k  <  n.  A  valne  of  1  in  the  slack 

colnmn  n  3- 1  at  row  j  indicates  that  the  jth  model  line  does  not  match  any  image  line.  A 
valne  of  1  in  the  slack  row  m  3-  1  at  colnmn  k  indicates  that  the  kth  image  line  does  not 
match  any  model  line.  The  objective  fnnction  E  in  eqnation  7  is  minimized  by  maximizing 
the  nnmber  of  correspondences  Ij  GG  where  d  (T  (Ij) ,  /^)  <  5finai- 

Optimizing  the  objective  fnnction  in  eqnation  7  as  a  fnnction  of  M  and  T  is  difficnlt 
becanse  it  reqnires  a  minimization  snbject  to  the  constraint  that  the  match  matrix  be  a 
zero-one  matrix  whose  rows  and  colnmns  each  snm  to  one.  A  typical  nonlinear  constrained 
optimization  problem  minimizes  an  objective  fnnction  on  a  feasible  region  that  is  dehned 
by  eqnality  and  ineqnality  constraints.  The  zero-one  constraint  on  the  match  matrix  is 
impossible  to  express  with  eqnality  and  ineqnality  constraints.  The  gradnated  assignment 
algorithm  developed  by  Gold  and  Rangarajan  [10,11)  can  efficiently  optimize  onr  objective 
fnnction  snbject  to  these  constraints.  This  algorithm  nses  deterministic  annealing  to 
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convert  a  discrete  problem  (for  a  binary  match  matrix)  into  a  continnons  one  that  is 
indexed  by  the  control  parameter  (3.  The  parameter  /3  (/3  >  0)  determines  the  nncertainty 
of  the  match  matrix,  and  thns  the  amonnt  of  smoothing  implicitly  applied  to  the  objective 
fnnction.  The  match  matrix  minimizing  the  objective  fnnction  is  tracked  as  this  control 
parameter  is  slowly  adjnsted  to  force  the  continnons  match  matrix  closer  and  closer  to  a 
binary  match  matrix.  This  has  two  advantages.  First,  it  allows  solntions  to  the  simpler 
continnons  problem  to  slowly  transform  into  a  solntion  to  the  discrete  problem.  Secondly, 
many  local  minima  are  avoided  if  an  objective  fnnction  is  minimized  which  is  highly 
smoothed  dnring  the  early  phases  of  the  optimization  bnt  gradnally  transforms  into  the 
original  objective  fnnction  and  constraints  at  the  end  of  the  optimization. 


We  minimize  the  objective  fnnction  by  hrst  compnting  the  variables  Mjk  that  minimize  E, 
assnming  that  the  transformation  T  is  hxed,  and  then  compnting  the  transformation  T 
that  minimizes  E,  assnming  that  the  Mjk  are  hxed.  This  process  is  repeated  nntil  these 
estimates  converge.  For  a  hxed  transformation  T,  the  continnons  match  matrix  M  is 
initialized  by 


M%  = 


1  ifj=m  +  lor/c  =  n  +  l 

exp  [—(3  (d  (T  ilj) ,  otherwise. 


(8) 


where  5  varies  between  So  at  the  start  of  the  optimization  and  dgnai  at  the  end.  Note  that  S 
determines  how  distant  two  lines  can  be  before  the  correspondence  becomes  nndesirable: 


M%  <  1  when  d  {T  (L) , 

AfO,  =  1  when  d{T  (Ij)  =  S^ 

Mjk  >  1  when  d  (T  (Ij) ,  <  S^. 

So,  for  example,  when  d  (T  (Ij) ,  Mj).  will  be  given  a  valne  less  than  the  initial 

slack  valnes  of  1  for  row  j  and  colnmn  k,  thns  initially  making  assignment  to  slack 
preferred  over  the  assignment  of  model  line  j  to  image  line  k.  Next,  we  enforce  the  match 
constraints  by  applying  to  the  Sinkhorn  algorithm  (21)  of  repeated  row  and  colnmn 
normalizations: 


repeat 

1  <  j  <  m,  1  <  fe  <  n  +  1. 

Mfk'  =  1  <  j  <  m  +  1,  1  <  fe  <  n. 

until  ||M*+^  —  M*||  small 

Sinkhorn  showed  that  when  each  row  and  colnmn  of  a  sqnare  matrix  is  normalized  several 
times  by  the  snm  of  the  elements  of  that  row  or  colnmn,  respectively  (alternating  between 
row  and  colnmn  normalizations),  the  resnlting  matrix  converges  to  one  that  has  positive 
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elements  with  all  rows  and  columns  summing  to  1,  in  other  words,  a  probability 
distribution.  However,  this  is  only  approximate  for  a  non-square  matrix  such  as  ours:  the 
rows  or  the  columns  will  sum  to  one,  but  both  will  not.  When  (3  is  small,  all  elements  of 
will  be  close  to  the  neutral  value  of  1;  this  represents  a  high  degree  of  uncertainty  in 
the  correspondences.  As  (3  increases  (and  presumably  the  accuracy  of  the  pose  as  well),  the 
uncertainty  in  the  correspondences  decreases  and  the  elements  of  move  toward  the 
values  of  0  or  oo.  Thus,  the  match  matrix  approximates  a  continuous  probability 
distribution  when  (3  is  small,  and  ends  as  a  binary  correspondence  matrix  when  (3  is  large. 
Appendix  B  describes  changes  that  we  have  made  in  the  Sinkhorn  algorithm  that  often 
result  in  improved  convergence  of  the  graduated  assignment  algorithm  to  the  local  optima. 


We  also  need  to  compute  the  affine  transformation  T  that  minimizes  the  objective  function 
E,  assuming  that  the  continuous  valued  match  matrix  M  is  held  constant.  This  is  difficult 
to  do  directly  because  of  the  complex  nonlinear  relation  between  T  and  the  nearest 
neighbor  distance  function  d.  Instead,  we  replace  d  with  a  new  distance,  d',  whose  square  is 
the  sum  of  the  squared  distances  of  the  end  points  of  an  image  line  to  the  inhnitely 
extended  model  line.  For  a  model  line  Ij  and  an  image  line  /(,,  the  new  squared  distance 
between  T  {I j)  and  /(.  is 


where  and  p(.2  the  two  end  points  of  and  where  T  (rij)  ,  T  (pji),  and  T  {pj2) 
denote  the  normal  and  two  end  points  of  T  (Ij),  respectively.  The  new  objective  function  is 

m  n 

=  {d' (T  (h)  Af  -  i")  ■ 

j=l  k=l 

In  general,  the  transformation  T  that  minimizes  E'  is  not  guaranteed  to  minimize  E.  In 
practice,  however,  because  three  line  correspondences  dehne  a  2-D  affine  transformation, 
one  would  expect  E  and  E'  to  have  approximately  the  same  minimizers  whenever  the 
model  has  three  or  more  lines  in  a  non-degenerate  conhguration.  Since  the  expression  for 
d'  (T  (Ij) ,  /(,)^  involves  rotating  a  vector  and  transforming  a  point,  it  is  actually  easier  to 
reverse  the  roles  of  Ij  and  /(,  and  minimize  E'  by  computing  the  inverse  transformation 
T'  =  {A',t'}  that  maps  image  lines  into  the  frame  of  reference  of  the  model.  This  way,  the 
model  normal  vectors  are  constants  and  T'  is  applied  only  to  image  points.  Then,  T  is 
computed  as  the  inverse  of  T' .  The  objective  function  that  is  minimized  to  determine  T'  is 

m  n  2 

^"  =  E  E  E  (W  a'p'm + 1'  -  pAf  -  (9) 

j=l  k=l  i=l 
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where  rij  =  is  the  unit  normal  vector  of  model  line  Ij,  pji  =  [xji,yji]  is  one  end 

point  of  model  line  Ij,  and  p'j.^  =  is  the  ith  end  point  (i  =  1,  2)  of  image  line  We 

can  hnd  the  transformation 


a'  b’ 
c'  d' 


t'. 


that  minimizes  equation  9  by  solving  the  system  of  six  equations 


dE''/da!  =  0,  dE” jdh'  =  0,  dE” jdd  =  0, 
dE''ldd!  =  ^,  ^^7^4  =  0,  ^^7^7  =  0 


(10) 


for  a',  6',  c',  d',  7,  and  The  solution  to  this  system  of  equations  is  given  in  appendix  C. 


Figure  7  compares  the  values  of  E  and  E'  over  a  typical  application  of  the  graduated 
assignment  algorithm.  Pseudocode  for  the  pose  rehnement  and  verihcation  algorithm  is 
shown  in  hgure  8.  Figure  9  shows  an  example  of  how  graduated  assignment  transforms  an 
initial  approximate  pose  into  a  more  accurate  pose. 


Figure  7.  Comparison  of  the  two  objective  functions  for  a  typical  minimization  by  the  graduated 
assignment  algorithm.  (The  solid  line  is  E,  which  uses  the  Euclidean  distances  to  the 
nearest  neighbor,  and  the  dotted  line  is  E' ,  which  uses  the  sum  of  the  distances  of  the  image 
line  endpoints  to  the  infinitely  extended  model  lines.  The  values  of  E  and  E'  generally 
decrease  during  the  optimization  process,  but  they  can  also  rise  because  of  changes  in  the 
assignment  variables  Mjk-) 
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initialize:  T  =  Tq,  (3  =  (3o,  5  =  Sq,  e  =  oo,  k  =  0,  maxsteps 
=  30. 

while  k  i  maxsteps  and  e  >  ehait;  do 
Initialize  according  to  Eqnation  8. 

Apply  Sinkhorn’s  algorithm  to  to  prodnce  M. 
Compnte  via  Eqnation  10. 

Compnte  Tk  as  the  inverse  of  T^. 
e  =  max  \Tk  -  Tk-i\. 

k  =  k  +  1- 

S  =  6  —  (dfinai  —  <^o)  /maxsteps. 

(3  —  /^update  ^  13. 

end  while 


Fignre  8.  The  pose  refinement  algorithm. 


Fignre  9. 


Pose  refinement  using  the  graduated  assignment  algorithm:  initial  pose  (left)  and  final  pose 
after  applying  the  algorithm  (right). 


7.  Experiments 


To  validate  onr  approach,  we  recognized  partially  occlnded  2-D  and  3-D  objects  in 
clnttered  environments  nnder  a  wide  variety  of  viewpoints.  All  images  were  acqnired  at  a 
resolntion  of  800  x  600  pixels;  400  to  800  lines  were  typically  detected  in  an  image,  and 
each  model  had  between  20  and  80  lines.  First,  we  nsed  books  to  test  the  recognition  of 
planar  objects.  Fignre  10  illnstrates  recognition  resnlts  when  onr  algorithm  is  applied  to  an 
image  of  a  pile  of  books.  For  all  bnt  one  of  the  hve  recognized  books,  the  pose  hypothesis 
leading  to  correct  recognition  was  fonnd  in  the  top  10  hypotheses  of  the  sorted  hypothesis 
list.  One  book  (“Linnx,”  shown  in  the  lower  right  of  hgnre  10a)  was  not  fonnd  nntil  the 
24th  pose  hypothesis.  This  book  might  be  more  difficnlt  for  onr  approach  to  recognize 
becanse  a  large  part  of  its  cover  depicts  a  horse  with  cnrved  borders,  for  which  detected 
lines  were  inconsistent  in  images  acqnired  from  different  viewpoints. 
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Figure  10.  Five  books,  some  partially  occluded,  are  recognized  in  a  cluttered  environment. 

The  performance  of  our  algorithm  depends  on  how  reliably  it  can  move  those  pose 
hypotheses  associated  with  correct  correspondences  to  the  top  of  the  sorted  hypothesis  list. 
To  evaluate  this,  we  estimate  Pe  {k),  the  probability  that  one  of  the  hrst  k  sorted  pose 
hypotheses  for  a  model  leads  to  a  correct  recognition  when  the  viewpoint  of  the  recognized 
object  and  the  viewpoint  used  to  generate  its  model  differ  by  an  angle  of  9,  assuming  that 
an  instance  of  the  model  does  appear  in  the  image.  Knowing  Pg  (k)  allows  one  to 
determine  how  many  pose  hypotheses  should  be  examined  by  the  pose  rehnement  process 
before  restarting  with  a  new  model,  either  of  a  new  object  or  of  the  same  object  but  from  a 
different  viewpoint.  Because  Pg  {k)  is  highly  dependent  on  the  amount  and  type  of  clutter 
and  occlusion  in  an  image  and  because  the  level  of  clutter  and  occlusion  present  in  our  test 
was  held  hxed,  Pg  {k)  should  be  interpreted  loosely.  The  six  books  shown  in  hgure  10a  were 
used  to  perform  this  experiment.  All  six  books  were  placed  flat  on  a  table  along  with  a 
number  of  other  objects  for  clutter.  Each  book  in  turn  was  moved  to  the  center  of  the 
table  and  then  rotated  on  the  plane  of  the  table  to  eight  different  orientations,  where  each 
orientation  was  separated  by  approximately  45  degrees.  For  each  of  these  orientations,  the 
camera  was  positioned  at  angles  0,  10,  20,  30,  40,  50,  60,  and  70  degrees  relative  to  the 
normal  of  the  table  (0  degree  is  directly  overhead)  and  at  a  hxed  distance  from  the  center 
book,  and  then  an  image  was  acquired.  The  center  books  were  unoccluded  in  these 
experiments.  A  separate  image  was  also  acquired  of  each  book  in  an  arbitrary  orientation 
with  the  camera  in  the  0-degree  position;  these  latter  images  were  used  to  generate  the 
book  models.  Figure  11  shows  an  image  of  the  books  on  this  table  for  the  camera  in  the 
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Figure  11.  Image  of  a  table  of  books  taken  by  a  camera  tilted  50  degrees  from  vertical. 

50-degree  position.  We  then  applied  our  algorithm  to  each  model  and  image  pair  and 
determined  the  position  in  the  sorted  hypothesis  list  of  the  hrst  hypothesis  that  allowed  the 
object  to  be  recognized.  As  many  as  100  hypotheses  were  examined  for  each  model  and 
image  pair.  The  estimated  values  of  Pg  {k)  are  shown  in  hgure  12.  From  this  we  see  that 
for  planar  objects  whose  orientations  differ  by  as  much  as  60  degrees  from  the  modeled 
orientation,  a  probability  of  correct  recognition  of  0.8  can  be  achieved  if  we  examine  the 
hrst  30  pose  hypotheses.  By  examining  just  the  top  four  pose  hypotheses,  we  can  achieve  a 
probability  of  correct  recognition  of  1.0  for  objects  whose  orientations  differ  by  as  much  as 
40  degrees  from  the  modeled  orientations.  Thus,  a  good  strategy  would  be  to  apply  the 
algorithm  with  a  set  of  models  for  each  object  generated  for  every  40-degree  change  in 
viewpoint;  in  this  case,  it  would  be  sufficient  to  represent  planar  objects  by  hve  models  in 
order  to  recognize  all  orientations  of  up  to  80  degrees  from  the  normal. 

Finally,  we  applied  our  algorithm  to  three  3-D  objects.  We  acquired  17  images  of  each 
object,  where  the  objects  were  rotated  by  2.5  degrees  between  successive  images.  The  hrst 
image  of  each  object  was  used  to  represent  the  object,  and  from  this,  we  generated  a  2-D 
model  by  identifying  the  object  edges  in  that  image.  With  only  the  top  10  sorted  pose 
hypotheses  examined,  all  three  objects  were  successfully  recognized  from  all  viewpoints 
that  dihered  by  as  much  as  25  degrees  from  the  modeled  viewpoint.  Two  of  the  objects 
(the  monitor  and  hole  punch)  were  also  recognized  at  30  degrees  away  from  the  modeled 
viewpoints.  Figure  13  shows  the  object  models,  the  images,  the  detected  lines,  and  the 
hnal  poses  of  the  recognized  objects  for  the  most  distant  viewpoints  from  which  each  was 
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Figure  12.  Pg  (k)  is  the  probability  that  one  of  the  first  k  sorted  pose  hypotheses  for  a  model  leads  to 
a  correct  recognition  for  that  model.  {9  is  the  difference  in  viewpoint  elevation  between  the 
model  and  the  object.  For  9  <  40°,  one  of  the  four  highest  ranked  pose  hypotheses  always 
leads  to  correct  recognition.  The  curves  for  0  =  0  through  40  degrees  are  superposed  for 
k  >  4.) 


Figure  13.  Recognition  of  3-D  objects  from  viewpoint-dependent  2-D  models:  computer  monitor  (top 
row),  stapler  (middle  row),  and  hole  punch  (bottom  row).  (Shown  in  each  row,  from  left  to 
right,  is  the  2-D  object  model,  original  image,  image  lines,  and  model  of  recognized  object 
overlaid  on  the  original  image.  The  modeled  view  of  each  object  differs  from  the  test  view 
by  20  to  30  degrees.) 
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recognized.  The  range  of  recognizable  object  orientations  conld  have  been  be  extended 
somewhat  if  more  pose  hypotheses  has  been  examined,  bnt  at  some  point,  it  becomes  more 
cost  effective  to  add  a  new  model  for  a  different  viewpoint. 

The  minimnm  nnmber  of  2-D  models  needed  to  represent  a  2-D  or  3-D  object  can  be 
determined  from  the  maximum  difference  between  the  object’s  orientation  and  a  modeled 
orientation  that  still  allows  the  object  to  be  recognized  from  that  2-D  model.  Let  this 
maximnm  difference  between  object  and  model  orientation  be  denoted  by  0.  To  be 
conservative,  based  on  resnlts  described  ,  we  take  0  =  40  degrees  for  2-D  objects,  and  0  = 
20  degrees  for  3-D  objects.  We  then  determine  the  minimnm  nnmber  of  2-D  models  needed 
to  represent  an  object  by  connting  the  nnmber  of  identical  right- dr cnlar  cones  of  angle  0 
that  are  needed  to  fnlly  enclose  the  npper  hemisphere  (for  the  case  of  2-D  objects)  or  the 
entire  sphere  (for  the  case  of  3-D  objects)  when  the  vertices  of  the  cones  are  placed  at  the 
sphere’s  center.  (The  angle  of  a  right  circnlar  cone  is  the  angle  aronnd  the  vertex  of  the 
cone  between  the  cone’s  axis  and  the  conic  snrface.)  One  can  determine  that  six  cones  of 
angle  40  degrees  fnlly  enclose  the  npper  hemisphere  while  44  cones  of  angle  20  degrees  fnlly 
enclose  the  entire  sphere.  Thns,  recognizing  2-D  objects  from  any  viewpoint  above  the 
plane  of  the  object  reqnires  at  most  six  2-D  models,  while  recognizing  3-D  objects  from  any 
viewpoint  (above  or  below  the  gronnd  plane)  reqnires  at  most  44  2-D  models. 

From  these  experiments,  it  is  apparent  that  only  a  small  nnmber  of  pose  hypotheses  need 
to  be  examined  by  the  pose  rehnement  algorithm  in  order  to  reliably  recognize  objects.  We 
nse  this  to  determine  the  overall  rnn  time  complexity  of  onr  algorithm.  Assnme  that  we 
have  q  models,  each  containing  m  lines,  and  that  the  image  contains  n  lines.  Initialization 
of  the  nearest  neighbor  data  strnctnre  and  identihcation  of  corners  can  be  performed  in 
O  (nlogn)  time.  For  each  model,  we  generate  O  {mn)  pose  hypotheses.  The  neighborhood 
similarity  of  each  of  these  can  be  evalnated  in  O  (logn)  time.  The  pose  hypotheses  can  be 
sorted  in  time  O  (mn  log  {mn)).  The  pose  rehnement  algorithm  reqnires  O  {mn)  time. 

Thns,  the  overall  rnn  time  complexity  of  onr  algorithm  is  O  {qmn log  {mn)). 


8.  Conclusions 


We  have  presented  an  efficient  approach  to  recognizing  partially  occlnded  objects  in 
clnttered  environments.  Onr  approach  improves  on  previons  approaches  by  employing 
information  available  in  one  or  two  line  correspondences  to  compnte  approximate  object 
poses.  Only  a  few  model  lines  need  to  be  nnfragmented  in  an  image  in  order  for  onr 
approach  to  be  snccessfnl;  this  condition  is  easily  satished  in  most  environments.  The  nse 
of  one  or  two  line  correspondences  to  compnte  an  objects  pose  allows  for  a  large  rednction 
in  the  dimensionality  of  the  space  that  mnst  be  searched  in  order  to  hnd  a  correct  pose.  We 


23 


then  developed  an  efficiently  compnted  measnre  of  the  similarity  of  two  line  neighborhoods 
that  is  largely  nnaffected  by  clntter  and  occlnsion.  This  provides  a  way  to  sort  the 
approximate  model  poses  so  that  only  a  small  nnmber  need  to  be  examined  by  more 
time-consnming  algorithms.  Experiments  show  that  a  single  view  of  an  object  is  snfficient 
to  bnild  a  model  that  will  allow  recognition  of  that  object  over  a  wide  range  of  viewpoints. 
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A.  Line  Detection 


Line  segments  in  models  are  matched  to  line  segments  in  images.  Each  line  segment  in  a 
model  or  image  is  represented  by  its  two  end  points.  Generation  of  model  lines  may  be 
performed  mannally  by  the  nser  or  antomatically  by  the  application  of  image  processing  to 
images  of  the  objects  to  be  recognized.  We  cnrrently  nse  pnblicly  available  software  {I4)  to 
antomatically  locate  model  and  image  lines.  Briefly,  this  software  operates  as  follows.  The 
Canny  edge  detector  is  hrst  applied.  Next,  contignons  edge  pixels  are  linked  together  into 
contonrs  and  very  short  contonrs  are  discarded.  Each  contonr  into  line  segments  by 
breaking  the  contonr  at  edge  pixels  nntil  no  edge  pixel  is  more  than  a  specihed  distance 
from  the  line  connecting  the  two  end  points  of  its  snbcontonr;  this  is  done  by  Ending  the 
longest  snbcontonr  (starting  at  the  hrst  edge  point)  whose  maximnm  distance  from  the  line 
connecting  the  end  points  of  the  snbcontonr  is  less  than  a  threshold.  This  snbcontonr  is 
replaced  by  the  line  segment,  and  then  the  process  is  repeated  for  the  remainder  of  the 
contonr. 

In  onr  experience,  for  images  with  dense  edges,  this  approach  to  line  detection  performs 
better  than  the  Hongh  Transform  approach  (8).  The  high  connectivity  of  the  edges 
prodnced  by  the  Canny  edge  detector  greatly  simplifies  the  process  of  htting  lines  to  those 
contonrs  when  the  contonr  partitioning  approach  is  nsed.  Line  htting  with  the  Hongh 
Transform,  on  the  other  hand,  is  easily  confonnded  by  spnrions  peaks  generated  by 
coincidental  alignment  of  physically  separated  edge  points  {12). 

A  reqnirement  of  onr  approach,  as  stated  in  section  1,  is  to  detect  at  least  one 
nnfragmented  image  line  segment.  An  evalnation  of  the  accnracy  of  onr  line  detector  shows 
that  this  reqnirement  is  easily  satished  for  the  types  of  scenes  described  in  this  report. 
Using  six  diherent  images  of  books  and  office  objects  (as  typified  by  images  shown 
thronghont  this  paper),  we  mannally  measnred  the  length  of  250  projected  model  lines  and 
the  lengths  of  the  corresponding  antomatically  detected  line  segments.  All  model  edges 
that  were  partially  or  fnlly  visible  were  measnred.  If  a  visible  model  edge  was  not  detected 
by  onr  software,  then  the  “corresponding  line  segment”  was  assigned  a  length  of  zero.  For 
each  model  edge,  the  relative  error  in  the  length  of  the  corresponding  detected  line  segment 
is  calcnlated  as  \{lm  —  h)  llm\  where  Im  is  the  length  of  the  projected  model  line  and  k  is 
the  length  of  the  detected  line  segment.  Fignre  A.l  shows  a  plot  of  the  relative  error  versns 
the  fraction  of  the  250  model  lines  that  are  detected  with  relative  error  no  greater  than 
that  amonnt.  One  can  see  that  11%  of  all  partially  and  fnlly  visible  model  lines  are 
detected  in  the  images  with  less  than  one  pixel  error  in  the  positions  of  their  end  points. 
Fnrthermore,  35%  of  all  model  lines  are  detected  as  image  segments  where  the  snm  of  the 
errors  in  the  endpoint  positions  is  no  more  than  5%  of  the  length  of  the  corresponding 
projected  model  lines;  that  is,  35%  of  the  visible  model  lines  are  detected  with  relative 
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error  in  length  that  is  less  than  5%.  We  hnd  that  5%  relative  error  is  small  enongh  to 
obtain  a  good  coarse  pose  hypothesis,  and  that  with  35%  of  the  model  lines  having  relative 
errors  no  larger  than  this,  there  will  be  many  snch  good  hypotheses  that  will  allow  the  pose 
rehnement  stage  of  the  algorithm  to  recognize  an  object. 


Fignre  A.l.  The  accuracy  of  our  line  detector  is  depicted  in  this  graph,  which  plots  the  relative  error 
versus  the  fraction  of  the  model  lines  detected  with  relative  error  no  greater  than  that 
amount. 
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B.  Modifications  of  the  Sinkhorn  Algorithm 


Sinkhorn’s  original  algorithm  {21 )  treats  all  rows  and  colnmns  identically.  The  modihed 
Sinkhorn  algorithm  discnssed  in  section  6  mnst  treat  the  slack  row  and  colnmn  differently 
from  other  rows  and  colnmns:  the  slack  valnes  are  not  normalized  with  respect  to  other 
slack  valnes,  only  with  respect  to  the  nonslack  valnes.  This  is  necessary  in  order  to  allow 
mnltiple  image  lines  to  be  identified  as  clntter  and  to  allow  mnltiple  model  lines  to  be 
identified  as  occlnded.  A  problem  with  this  modified  algorithm  is  the  following.  Snppose 
that  a  nonslack  valne  is  a  maximnm  in  both  its  row  and  colnmn.  After  that  row  is 
normalized,  it  is  possible  that  this  previonsly  maximal  valne  is  now  less  than  the  slack 
valne  for  that  colnmn.  The  same  sort  of  thing  can  happen  when  colnmns  are  normalized. 
Intnitively,  this  behavior  is  nndesirable:  nonslack  valnes  that  begin  maximal  in  both  their 
row  and  column  shonld  remain  maximal  in  their  row  and  colnmn  thronghont  the  Sinkhorn 
iteration.  The  pnrpose  of  Sinkhorn  normalization  is  not  to  shift  assignment  weights  aronnd 
bnt  only  to  normalize  the  assignments  so  that  they  approximate  a  probability  distribntion. 
A  secondary  problem  with  the  Sinkhorn  algorithm  is  that  the  order  of  normalization  (row 
first  or  colnmn  first)  can  have  a  significant  effect  on  the  final  normalization,  especially 
when  there  is  potential  for  “weight  shifting”  as  describe  before. 

To  minimize  weight  shifting,  after  performing  row  normalizations  we  set,  the  valnes  in  the 
slack  row  so  that  their  ratio  to  the  nonslack  valne  in  each  colnmn,  which  was  previonsly 
maximnm,  is  the  same  as  this  ratio  before  row  normalization.  A  similar  thing  is  done  after 
colnmn  normalizations.  In  addition,  to  eliminate  the  effect  of  normalization  order,  rows 
and  colnmns  are  normalized  independently  on  each  step  of  the  iteration  and  then  the  two 
normalized  matrices  are  combined  into  one.  The  psendocode  for  this  new  Sinkhorn 
algorithm  is  shown  in  fignre  B.l. 
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Figure  B.l.  The  new  Sinkhorn  algorithm. 
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C.  Solving  for  the  Affine  Transformation 


We  obtain  the  affine  transformation  that  minimizes  eqnation  9  by  solving  the  system  given 
in  eqnation  10.  Expanding  eqnation  10,  we  obtain  the  linear  system  Ax  =  b  where 
X  =  (a'  b’ d  d' t'^  t'yj  ,  A  is  the  6x6  symmetric  matrix 


(12  I  12  \ 

•^rij  K'^kl  W  •l'k2) 

+  x'k2y'k2) 
{X'ii  +  X'i^) 

Xn.Vn,  +  X'j,2y'k2) 
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Xn^yn^  {X'kiy'ki  +  42l/fe2) 
Xn^Vn,  {y'kl  +  yk2) 

yl,  {x'j,^y'ki  +  x'k2y'k2) 
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and  b  is  the  colnmn  6-vector  given  by 
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The  nnknown  x  is  easily  fonnd  via  standard  linear  methods. 
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